A Component-wise EM Algorithm for Mixtures 



Abstract: Estimation finite mixture distributions is typically an incomplete data structure problem for 
which the Expectation-Maximization (EM) algorithm can be used. One drawback of the algorithm is its 
slow convergence in some situations. In the mixtures case, little progress in speeding up EM has been 
made. Standard EM procedures update all parameters simultaneously. In the missing data context, it 
has been shown that sequential updating could lead to faster convergence. In this paper we propose a 
component-wise EM for mixtures, which updates the parameters sequentially. It intrinsically decouples 
the parameter updates so that the estimated proportions may not sum to 1 during an iteration. While 
maintaining monotone convergence, the algorithm may leave the parameter space but is guaranteed to 
return upon convergence. We give an interpretation of this procedure as a proximal point algorithm and 
use it to prove the convergence. Illustrative numerical experiments show how our algorithm compares to 
EM and a version of the SAGE algorithm, [un mot sur les perf] 

Key-words: EM algorithm, Kullback-Leibler divergence, Mixture estimation, Proximal point algorithm, 
SAGE algorithm. 



1 Introduction 

Estimation in finite mixture distributions is typically an incomplete data structure problem for which the 
EM algorithm is used (see for instance Dempster, Laird and Rubin 1977 and Redner and Walker 1984). 
The most documented problem occuring with the EM algorithm is its possible low speed in some situations. 
Many papers have proposed extensions of the EM algorithm based on standard numerical tools to speed 
up the convergence. Possible references include Louis (1982), Lewitt and Muehllehncr (1986), Kaufman 
(1987), Meilijson (1989), and Jamshidian and Jennrich (1993). There are often effective, but they do 
not guarantee monotone increase in the objective function. To overcome this problem, alternatives based 
on model reduction and efficient data augmentation have recently been considered. As regards model 
reduction, we refer to Meng and Rubin (1993), Liu and Rubin (1994). For data augmentation, see Fessler 
and Hero (1994, 1995), Hero and Fessler (1995), Meng and van Dyk (1997, 1998), Neal and Hinton (1998), 
Liu, Rubin and Wu (1998), see also the chapter 5 of McLachlan (1997). These extensions share the 
simplicity and stability with EM while speeding up the convergence. However, as far as we know, only 
two extensions were devoted to speeding up the convergence in the mixture case which is one of the most 
important domains of application for EM (Pilla and Lindsay 1996, Liu and Sun 1997). The first one of 
Pilla and Lindsay (1996) is based on a restricted efficient data augmentation scheme for the estimation 
of the proportions for known discrete distributions. While the second extension of Liu and Sun (1997) is 
concerned with the implementation of the ECME algorithm (Liu and Rubin 1994) for mixture distributions. 
In this paper we propose, study and illustrate a component-wise EM algorithm (CEMM: Component-wise 
EM algorithm for Mixtures) aiming at overcoming the slow convergence problem in the finite mixture 
context. Our approach is based on a recent work of Chretien and Hero (1998a, b, 1999), which recasts the 
EM procedure in the framework of proximal point algorithms Rockafellar (1976a) and Teboulle (1997). 
In Section [2] we present the EM algorithm for mixtures. In Section [3j we describe our component-wise 
algorithm and show that it can be interpreted as a proximal point algorithm. Using this interpretation, 
convergence of CEMM is proved in the same section. Illustrative numerical experiments comparing the 
behaviors of EM, a version of the SAGE algorithm (Fessler and Hero 1994, Fessler and Hero 1995) and 
CEMM are presented in Section[4] Concluding remarks end the paper. An appendix carefully describes the 
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SAGE method in the mixture context in order to provide detailed comparison with the proposed CEMM. 



2 The EM algorithm for mixtures 



We consider a J-componcnt mixture in R' 



d 



J 



g(y\ ) = ^2pj<p(v\<xj) 



(i) 



J 



where the pj's (0 < pj < 1 and 2_^Pj = -0 are ^ ne mrx ing proportions and where (f(y\a) is a density 

j'=i 

function parametrized by a. The vector parameter to be estimated is 9 = {jp\, . . . ,pj, oil, . . . , aj). The 
parametric families of mixture densities are assumed to be identifiable. This means that for any two 
members of the form (JXJ) , 



if and only if J = J' and we can permute the components labels so that pj = py and ip(y\otj) = ip(y\otj>), 
for j = 1, . . . , J. Most mixtures of interest are identifiable (see for instance Redner and Walker 1984). For 
the sake of simplicity, we restrict the present analysis to Gaussian mixtures, but extension to more general 
mixtures is straightforward. Thus, ip(y\n, S) denotes the density of a Gaussian distribution with mean \i 
and variance matrix S. The parameter to be estimated is 



In the following, we denote 8j — (pj,Hj, Sj), for j = 1, . . . , J. We also denote by the parameter space 
{(pi, ■ ■ ■ ,pj, Hi, . . . , Ei, . . . , £,/)} and by 0' the affine submanifold 



The mixture density estimation problem is typically a missing data problem for which the EM algorithm 
appears to be useful. 

Let y = (j/i, . . . , y n ) £ R dn be an observed sample from the mixture distribution g(y\9). We assume that 
the component from which each yi arised is unknown so that the missing data are the labels Zi, i — 1, . . . , n. 
We have Zj = j if and only if j is the mixture component from which arises. Let z = (zi, ... , z n ) denote 



g{y\0) = g{y\0') 



9 = (px,...,pj, /zi,..., fij, Ei,..., Ej). 



j 
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the missing data, z £ B n , where i? = {1, . . . , J}. The complete sample is x — [x\, . . . , x n ) with = (yi, Zi) 
and we have x = (y, z). The observed log-likelihood is 

L(0|y)=logg(y|0), 

where g(y|0) denotes the density of the observed sample y. Using (JXJ) leads to 

n f J 

»=i [j=i 

The complete log-likelihood is 

L(0|x)=logf(x|0), 
where f (x|0) denotes the density of the complete sample x. We have 

n 

L(9\x) = ^2{]ogp tl +]og<p(y i \f ixt ,Jl Xt )}. (2) 



The conditional density function of the complete data given y 

f(*|0) 

s(y\o) 



t(x|y,6>) = (3) 



takes the form 

71 

t(x|y,0)=IJW0) (4) 

where Uj(0),j — 1, ...,J denotes the conditional probability, given y, that j/$ arises from the mixture 
component with density |/Ltj , Ej). ^From Bayes formula, we have for each i (1 < j < n) and j (1 < j < J) 

tij (8)= pMvi\n,Vi) . (5) 

Thus the conditional expectation of the complete log- likelihood given y and a previous estimate of 9, 
denoted 0' , 

Q(0\0') = iE[logf(0|x)|y,0'] 

takes the form 

n J 

Q( d \ d ') = E E i lo s^ + lo § <P(Vi\K> ^)} ■ ( 6 ) 

The EM algorithm generates a sequence of approximations to derive the maximum observed likelihood 
estimator starting from an initial guess 9°, using two steps. The fcth iteration is as follows 



E-step: Compute Q(6\6 k ) = E [logf(x|0)|y, 6 k ] . 
M-step: Find 6 k+1 = argmax Q{6\6 k ), 



In many situations, including the mixture case, the explicit computation of Q(9\9 k ) in the E-step is 
unnecessary and this step reduces to the computation of the conditional density t(x|y,6> fe ). For Gaussian 
mixtures, these two steps take the form 



E-step: For i = 1, . . . , n and j = 1, . . . , J compute 



M* fc )= f {v ^ ] . (7) 



M-step : Set 6 k+1 = (p k+ \ . . . lP k +\ ^ k+ \ . . . , . . . , with 



fc+ i _ i=i_ 



E*«(» fc )(w-^ +1 )(w-^ +1 ) T 



E*ii(» fc ) 



Note that at each iteration, the following properties hold 

J 



for i = l,...,n, X^(6> fe ) = l 



and£>* = l. (9) 
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3 A Component-wise EM for mixtures 

Serial decomposition of optimization methods is a well known procedure in numerical analysis. Assuming 
that 9 lies in MP, the optimization problem 

max<I>(6>) 

0GRP 

is decomposed into a series of coordinate-wise maximization problems of the form 

max . . . ,0j_i, r),0j +1 , . . .,6 P ). 

This procedure is called a Gauss-Seidel scheme. The study of this method is standard (see Ciarlet 1988 
for example). It has the advantage of using the new information as soon as it is available rather than 
waiting until all parameters have been updated. One of the most promising general purpose extension of 
EM, going in this direction, is the Space-Alternating Generalized EM (SAGE) algorithm of Fessler and 
Hero (1994). Improved convergence rates are reached by updating the parameters sequentially in small 
groups associated to small missing data spaces rather than one large complete data space. The idea is that 
less informative missing data spaces lead to smaller root-convergence factors and hence faster converging 
algorithms. General description and details concerning the rationale, the properties and illustrations of 
the SAGE algorithm can be found in Fessler and Hero (1994,1995), Hero and Fessler (1995). The CEMM 
algorithm is closely related to the SAGE approach. For comparison purpose, we described in the appendix 
of Celeux and al. (1999), a version of SAGE for Gaussian mixtures. This version is nearly a component- 
wise algorithm except that the mixing proportions need to be updated in the same iteration, which involves 
the whole complete data structure. For this reason, it may not be significantly faster than the standard 
EM algorithm. This points out the main interest of the component-wise EM algorithm that we propose 
for mixtures. No iteration needs the whole complete data space as missing data space. It can therefore be 
expected to converge faster in various situations. 

3.1 The CEMM algorithm 

Our Component-wise EM algorithm for Mixtures (CEMM) considers the decomposition of the parameter 
vector 9 = (Oj,j = 1, J) with 6j = (pj, fij,T,j). The idea is to update only one component at a time, 
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letting the other parameters unchanged. The order according to which the components are visited may be 
arbitrary, prescribed or varying adaptively. For simplicity, in our presentation, the components are updated 
successively, starting from j = 1,..., J and repeating this after J iterations. Therefore the component 
updated at iteration k is given by (flTJI) and the fcth iteration of the algorithm is as follows. For 

j = k-j\J + l, (10) 

.J denoting the integer part, it alternates the following steps 
E-step: Compute for i = 1, . . . , n, 



M-step: Set 



n 
i=i 



fe+l _ i=l 



Yja{0 k ) (12) 



j^k+l _ i=l 



and for I + j, 6C +1 = 



k+l _ ()k 



Note that the main difference with the SAGE algorithm presented in Celeux and al. (1999) is that the 
updating steps of the mixing proportions cannot be regarded as maximization steps as in SAGE. Also, in 
CEMM, the pj's in (fT2]) do not necessarily sum to 1, so that the algorithm may leave the parameter space. 
Consequently, the SAGE standard assumptions are not satisfied and a specific convergence analysis must 
be achieved. It shows that the CEMM algorithm is guaranteed to return in the parameter space upon 
convergence. It is based on the proximal interpretation of CEMM given in the next subsection. 
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3.2 Lagrangian and Proximal representation of CEMM 

As shown in Chretien and Hero (1998a), the EM procedure can be recast into a proximal point framework. 
This point of view provides much insight into the algorithm convergence properties (see Chretien and Hero 
1999). The proximal point algorithm was first studied in Rockafellar (1976b). The proximal methodology 
was then applied to many types of algorithms and is still in great effervescence (see Teboulle 1992,1997 for 
instance and the literature therein). Considering the general problem of maximizing a concave function 
$(#) on W, the proximal point algorithm is an iterative procedure which is defined by the following 
recurrence, 

9 k+1 = arg max j $(<9) - - II 9 - 9 k 

In other words, the objective function $ is regularized using a quadratic penalty \\9 - 9 k \\ 2 . The EM 
algorithm can be viewed as a generalized proximal point algorithm where the quadratic regularization is 
replaced by a Kullback-type penalty. Note that this presentation includes the interpretation of EM as 
an alternating optimisation algorithm (Neal and Hinton 1998, Hathaway 1986 in the mixture context). 
Equation (flUj) becomes 

9 k+1 = arg max {L{9 | y) - D(9, 9 k |y) } , (14) 

9 E0 

where L{9 | y) is the observed log-likelihood of Section [21 The penalty term D(9,9 k \y) is the Kullback- 
Leibler divergence / between the two conditional densities t(x|y, 9) and t(x|y, 9 k ) as defined in ([3]), 

D(9,9'\y) = /(t(x|y,0'),t(x|y,0)) = E [\ og t(x|y ' g |y; 9'] . (15) 

t(x|y,0) 

This quantity is well defined under unrestrictive regularity assumptions of the parametrized conditional 
densities t(x|y, 9) with 9 6 O (see Celeux, Chretien, Forbes and Mkhadri 1999 for details). A question of 
importance is whether or not the following property holds, 

£>(0,0'|y)=O 0' = 9. (16) 

Since the Kullback-Leibler divergence is strictly convex, nonnegative and is zero between identical distri- 
butions, D vanishes iff t(9') = t(9). However, the operator defined by t(.) is not injective on the whole 
parameter space. Therefore, (|16[) does not generally hold and the Kullback information does not a priori 




behave like a distance in all directions of the parameter space. Howevere, in the mixture case, fp~6|) holds 

J 

when the constraint ^^Pt = 1 is satisfied. In addition, we proved in Celeux and al. (1999) that t(.) is 

1=1 

coordinate-wise injective which allows the Kullback measure to enjoy some distance-like properties at least 
on coordinate subspaces. More specifically, we proved (see Lemma 1 in Celeux and al. 1999) that for any 
v in {1, . . . , J} the operator t(9\, . . . , 0„_i, ., 6 v+ \, . . . , 6j) is injective. The result below follows easily. 

Lemma 1 The distance-like function D(9,9' | y) satisfies the following properties 

(i) D{9, 0' | y) > for all 0' a nd 9 in 6, 

(ii) if 9 and 9' only differ in one coordinate, D(6,9' y) = implies 9' = 9. 

This result is essential in proving convergence properties (see Subsection 13.31) of the CEMM algorithm. 

The main difficulty in passing to a component-wise approach is the treatment of the constraint 

J 

53 ^ = L 

e=i 

Usually, a reduced parameter space is introduced, 

| (p!, . . . . . . ,fJLj, Si, . . . , S H }, (17) 



the remaining proportion being deduced from the equality 

.7-1 



PJ 



see Redner and Walker (1984) for instance. This is obviously not satisfactory in the context of coordinate- 
wise methods. A Lagrangian approach (ref ??) seems more appropriate. It first provides an alternative 
interpretation of the EM algorithm for mixtures, where the parameter n is nothing but the Lagrange 
multiplier associated to the proportion constraint. The EM algorithm for mixtures is equivalent to the 
following iteration, 

if ,1 (i8) 



1 = argmax |g(0|0 fc ) - n(j> - l) j . 



Then, Looking at the maximization steps (fl2j) and ((Sj) and using formulation (|18l) for EM, we can easily 
deduce the proximal representation of CEMM. We refer to Celeux and al. (1999) for a proof. 
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Proposition 1 The CEMM recursion is equivalent to a coordinate-wise generalized proximal point proce- 
dure of the type 



9 k+1 = arg max j L(9 | y) - n(Y> - 1) - D(0, 9 k \ y) \ 
oee k L e=i j 

where 0^ is the parameter set of the form 
with j = k - j J J + 1. 

We now establish a series of results concerning the CEMM iterations. 
3.3 Convergence of CEMM 

Assumption 1 Let 9 be any point in O. Then, the level set 

C e = {9' | L(9'\y) > L(0\y)} 

is compact. 

Let A(9 | y) be the modified log-likelihood function given by 

J 

A(0|y) = L(0|y)-n£>-l). 



(19) 



This function first arised in the Lagrangian framework of Section 13.21 It appears in the right-hand side of 
equation ([T8|) when the Kullback-type penalty is omitted. 



Proposition 2 The sequence {A(9 k \ y)}feeN is monotone non- decreasing, and satisfies 

A(9 k+1 | y) - A(9 k | y) > D(9 k+1 ,9 k \ y). (20) 
Proof. ^From iteration (fl9|). we have 

A(9 k+1 | y) - A{9 k | y) > D(9 k+ \9 k \ y)-D{9 k ,9 k \ y). 
The proposition follows from D(9 k+1 ,9 k | y) > and D(9 k ,9 k \ y) = 0. □ 
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Lemma 2 The sequence {$ fc } fceN is bounded and satisfies 



ken 

J 

k—±oo 



IimY>£ = l. (21) 



If in addition, {A(6> fc |y)} fc6N is bounded from above, 

lim \\9 k+1 -9 k \\ = 0. (22) 

k— >oc 

Sketch of proof. The fact that {6 |fe } fcgN is bounded is straightforward from Proposition[3]and Assumption 
[TJ Equations (|2"TT) and (j2"2")l can be shown using Lemma [T] and standard arguments on sequences (see Celeux 
and al. 1999 for details). □ 
The proof of the following theorem is in Appendix [X] 

Theorem 1 Every accumulation point 9* of the sequence {# fc } fcgN satisfies one of the following two prop- 
erties 

• A(6>* | y) = +oo 

• 8* is a stationary point of the modified log-likelihood function A(9 \ y). 

The following result is direct consequence of Corollary 4.5 in Chretien and Hero (1999). 

Corollary 1 Assume that the modified log-likelihood function A(8 \ y) is strictly concave in an open 
neighborhood of a stationary point of {^ fc } feeN - Then, the sequence {$ fc } fcgN converges and its limit is a 
local maximizer of K(9 \ y). 

The main convergence result for the CEMM procedure is stated below and its proof is given in Appendix 
B. 

Theorem 2 Every accumulation point of the sequence {$ fc } fegN * s a stationary point of the log -likelihood 
function L(6 \ y) on the set defined by the constraint Ylf—iPi = !• 

4 Numerical experiments 

The behaviors of EM, SAGE (as described in the Appendix of Celeux and al. 1999) and CEMM are 
compared on the basis of simulation experiments on univariate Gaussian mixtures with J — 3 components. 
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First, we consider a mixture of well separated components with equal mixing proportions pi — P2 = P3 = 
1/3, means fix = 0, /12 = 3, (13 = 6 and equal variances o~\ = a 2 = a 2 = 1. We will refer to this mixture as 
the well- separated mixture. Secondly, we consider a mixture of overlapping components with equal mixing 
proportions pi = p 2 = P3 = 1/3, means /Zi = 0, [i 2 = 3, /X3 = 3 and variances a\ = a\ = l,cr| = 4. This 
mixture will be referred to as the overlapping mixture. 

For the well-separated mixture we consider a unique sample of size n = 300 and perform the EM, SAGE 
and CEMM algorithms from the following initial position: 

Pi =P°2 =Pl = 1/3, Mi = x - s,M2 = x,i4 = x + «,tr? = ol = <7° 3 = s 2 

where x and s 2 are respectively the empirical sample mean and variance. Starting from this rather favorable 
initial position, close to the true parameter values, the three algorithms converge to the same solution below 

pi = 0.36, Ai = 0.00, a\ = 1.10 
p 2 = 0.29, pi 2 = 2.96, of = 0.38 
p 3 = 0.35, /t 3 = 5.90, af = 1.10 

The performances of EM, SAGE and CEMM, in terms of speed, are compared on the basis of the cycles 
number needed to reach the stationary value of the constraint log-likelihood. 



Figure 1 about here 



A cycle corresponds to the updating of all mixture components. For EM, it consists of a E-step (J7J and 
a M-step ([5]). For SAGE, it is the (J+l) iterations described in the Appendix. For CEMM, it consists 
of J iterations described in (| 1 1 1) and (fl"2"|) . In each case, a cycle of iterations requires the same number 
of algebraic operations , namely, J updatings of the mixing proportions, means and variance matrices and 
J x n updatings of the conditional probabilities Uj (9) . 

Figure [1] displays the log-likelihood versus cycle for EM, SAGE and CEMM in the well-separated mixture 
case. As expected, when starting from a good initial position in a well separated mixture situation, EM 
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converges rapidly to a local maximum of the likelihood. Moreover, EM outperforms SAGE and CEMM in 
this example. 

For the overlapping mixture, we consider two different samples of size n = 300 and performed the EM, 
SAGE and CEMM algorithms from the following initial position: 

Pi = P" = P% = 1/3, Mi - 0.0, & = 0.1, ,i° = 0.2, a? = a° 2 = a° 3 = 1.0, 

which is far from the true parameter values. For the first sample, the three algorithms converge to the 
same solution 

pi = 0.65, jli = 0.85, a\ = 1.28 
p 2 = 0.19, £ 2 = 3.32, of = 0.26 
p 3 = 0.16,A 3 = 5.67, ( 73 2 -2.10. 



Figures 2 and 3 about here 



Figure [5] displays the log-likelihood versus cycle for EM, SAGE and CEMM for the first sample of the 
overlapping mixture. In this situation, EM appears to converge slowly so that SAGE and especially 
CEMM show a significant improvement of convergence speed. 

For the second sample, starting from the same position, SAGE and CEMM both converge to the solution 
below 

pi = 0.61, Ai = 0.85, t\ = 1.62 
p 2 = 0.13, fi 2 = 3.00, 5f = 0.52 
p 3 = 0.26, Aa = 4.27,51 = 4.29, 

while EM proposes the following solution, after 1000 cycles, 

pi = 0.61, Ai = 0.83, a\ = 1.60 
p 2 = 0.16, fi 2 = 2.98, 5f = 0.62 
p 3 = 0.22, A3 = 4.58, a\ = 4.29. 

13 



Figure [3] displays the log-likelihood versus cycle for EM, SAGE and CEMM for the second sample of the 
overlapping mixture. The same conclusions hold for this sample. The CEMM algorithm is the faster 
while EM is really slow, the correspondant local maximum of the likelihood not being reached after 1000 
ierations. 

Moreover, it appears that the implemented version of the SAGE algorithm is less beneficial than CEMM 
for situations in which EM converges slowly. A possible reason for this behavior of SAGE is that the 
(J + l)th iteration of SAGE involves the whole complete data structure, whereas CEMM iterations never 
need the whole complete data space as missing data space. 

5 Concluding remarks 

We presented a component-wise EM algorithm for finite identifiable mixtures of distributions (CEMM) 
and proved convergence properties similar to that of standard EM. As illustrated in section 21 numerical 
experiments suggest that CEMM and EM have complementary performances. The CEMM algorithm is 
of poor interest when EM convergence is fast but shows significant improvement when EM encounters 
slow convergence rate. Thus, CEMM may be useful in many contexts. An intuitive explanation of our 
procedure performances is that the component-wise strategy prevents the algorithm from staying too long 
at critical points (typically saddle points) where standard EM is likely to get trapped. More theoretical 
investigations would be interesting but are beyond the scope of the present paper. 

Other futur directions of research include the use of relaxation, as in Chretien and Hero (1998b), for 
accelerating CEMM, and the possibility of using varying/ adaptative orders to update the components. 
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A Proof of Theorem 1 



B Proof of Theorem 2 

Let 9* be an accumulation point of {# fe }fceN- Note that 9* lies in 0' = |# € | Yli=iPe = l}- Take any 
vector 6 such that 0* + S lies in 0'. Since 0' is affine, an y point 9t = 9* + tS, t € R also lies in 0'. The 
directional derivative of A at 9* in the direction 5 is obviously null. It is given by 

( o=)A W | y) = i im A(»-|y>-A(»- + »iy> , 

t^o+ t 

which is equal to 

A /,~c, v r Mg; I y) - L{9* +t8\y) + c{9*) - c(9* + tS) 
A {9 ;S\y)= km , 

where c(0) = n(^J2i=iPe - lj- Since 6>* + tS lies in 0' for all nonnegative t, c(9* + tS) = c(9*) = 0, and 
we obtain 

A'(d*;6\y) = L'(9*;5\y). 

Thus, 

L'(9*;6 | y) =0 (23) 
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Figure 1: Comparison of log-likelihood versus cycle for EM (full line), SAGE (dashed line) and CEMM 
(dotted line) in the well-separated mixture case. 



Figure 2: Comparison of log-likelihood versus cycle for EM (full line), SAGE (dashed line) and CEMM 
(dotted line) in the overlapping mixture case (first sample). 
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Figure 3: Comparison of log-likelihood versus cycle for EM (full line), SAGE (dashed line) and CEMM 
(dotted line) in the overlapping mixture case (second sample). 
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